Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)url = "https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv"
sw_raw = pd.read_csv(url, encoding="ISO-8859-1")
sw_raw.head()| RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
| 1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
| 2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
| 3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
| 4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
Imagine you were able to predict someone’s income based on their favorite starwars movie? This program I created is able to do this at a 63 percent certianty. It also has a ton of useful data that will allow you to predict many other things about someone based off their favorite movie, age, or weither or not they have seen startrek.
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
type your results and analysis here The purpose of this block of code is to change up the names of the headers to be more usable, I also got rid of the resopse header row. At the end of it I outputted the old names and new names to see the changes.
raw_cols = sw_raw.columns.tolist()
char_cols = raw_cols[15:29]
char_names = sw_raw.iloc[0, 15:29].tolist()
rename_map = {
"RespondentID": "respondent_id",
"Have you seen any of the 6 films in the Star Wars franchise?": "seen_any",
"Do you consider yourself to be a fan of the Star Wars film franchise?": "sw_fan",
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_ep1",
"Unnamed: 4": "seen_ep2",
"Unnamed: 5": "seen_ep3",
"Unnamed: 6": "seen_ep4",
"Unnamed: 7": "seen_ep5",
"Unnamed: 8": "seen_ep6",
raw_cols[9]: "rank_ep1",
"Unnamed: 10": "rank_ep2",
"Unnamed: 11": "rank_ep3",
"Unnamed: 12": "rank_ep4",
"Unnamed: 13": "rank_ep5",
"Unnamed: 14": "rank_ep6",
"Which character shot first?": "shot_first",
"Are you familiar with the Expanded Universe?": "eu_familiar",
"Do you consider yourself to be a fan of the Expanded Universe?\x8cæ": "eu_fan",
"Do you consider yourself to be a fan of the Star Trek franchise?": "trek_fan",
"Gender": "gender",
"Age": "age_range",
"Household Income": "income_range",
"Education": "education",
"Location (Census Region)": "region",
}
for col, name in zip(char_cols, char_names):
simple = (
name.lower()
.replace(" ", "_")
.replace("-", "_")
.replace("3p0", "3po")
)
rename_map[col] = f"fav_{simple}"
sw = sw_raw.rename(columns=rename_map)
sw = sw[sw["respondent_id"].notna()].copy()
name_sample = (
pd.DataFrame({
"old_name": list(rename_map.keys()),
"new_name": list(rename_map.values())
})
.head(15)
)
name_sample| old_name | new_name | |
|---|---|---|
| 0 | RespondentID | respondent_id |
| 1 | Have you seen any of the 6 films in the Star W... | seen_any |
| 2 | Do you consider yourself to be a fan of the St... | sw_fan |
| 3 | Which of the following Star Wars films have yo... | seen_ep1 |
| 4 | Unnamed: 4 | seen_ep2 |
| 5 | Unnamed: 5 | seen_ep3 |
| 6 | Unnamed: 6 | seen_ep4 |
| 7 | Unnamed: 7 | seen_ep5 |
| 8 | Unnamed: 8 | seen_ep6 |
| 9 | Please rank the Star Wars films in order of pr... | rank_ep1 |
| 10 | Unnamed: 10 | rank_ep2 |
| 11 | Unnamed: 11 | rank_ep3 |
| 12 | Unnamed: 12 | rank_ep4 |
| 13 | Unnamed: 13 | rank_ep5 |
| 14 | Unnamed: 14 | rank_ep6 |
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
a. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
a. Create a new column that converts the education groupings to a single number. Drop the school categorical column
a. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
a. Create your target (also known as “y” or “label”) column based on the new income range column
a. One-hot encode all remaining categorical columns
type your results and analysis here I filtered the data to people who have seen one or more star wars movies, then I modified the age ranges into midpoint numbers. I did a similar thing for education by changing the categories into numbers to make it easier to map. I then continued to format the numbers in a way that would make it more usable.
sw_seen = sw[sw["seen_any"] == "Yes"].copy()
sw_seen.shape, sw_seen["seen_any"].value_counts()((936, 38),
seen_any
Yes 936
Name: count, dtype: int64)
age_map = {
"18-29": 24,
"30-44": 37,
"45-60": 52,
"> 60": 65
}
sw_seen["age_mid"] = sw_seen["age_range"].map(age_map)
sw_seen = sw_seen.drop(columns=["age_range"])
sw_seen[["age_mid"]].describe()| age_mid | |
|---|---|
| count | 820.000000 |
| mean | 45.126829 |
| std | 14.889697 |
| min | 24.000000 |
| 25% | 37.000000 |
| 50% | 52.000000 |
| 75% | 52.000000 |
| max | 65.000000 |
edu_map = {
"Less than high school degree": 1,
"High school degree": 2,
"Some college or Associate degree": 3,
"Bachelor degree": 4,
"Graduate degree": 5,
}
sw_seen["education_num"] = sw_seen["education"].map(edu_map)
sw_seen = sw_seen.drop(columns=["education"])
sw_seen[["education_num"]].value_counts().sort_index()
edu_label = {
1: "Less than HS",
2: "HS",
3: "Some college/AA",
4: "Bachelor",
5: "Graduate",
}
edu_counts = (
sw_seen["education_num"]
.value_counts()
.sort_index()
.rename(index=edu_label)
)
edu_countseducation_num
Less than HS 3
HS 71
Some college/AA 254
Bachelor 262
Graduate 226
Name: count, dtype: int64
income_map = {
"$0 - $24,999": 12500,
"$25,000 - $49,999": 37500,
"$50,000 - $99,999": 75000,
"$100,000 - $149,999": 125000,
"$150,000+": 175000,
}
sw_seen["income_numeric"] = sw_seen["income_range"].map(income_map)
sw_seen = sw_seen.drop(columns=["income_range"])
sw_seen["income_numeric"].describe()count 675.000000
mean 77685.185185
std 49360.364929
min 12500.000000
25% 37500.000000
50% 75000.000000
75% 125000.000000
max 175000.000000
Name: income_numeric, dtype: float64
sw_seen["high_income"] = (sw_seen["income_numeric"] >= 50000).astype(int)
sw_model = sw_seen.dropna(subset=["income_numeric"]).copy()
sw_model["high_income"].value_counts(normalize=True)high_income
1 0.637037
0 0.362963
Name: proportion, dtype: float64
sw_seen["high_income"] = (sw_seen["income_numeric"] >= 50000).astype(int)
sw_model = sw_seen.dropna(subset=["income_numeric"]).copy()
rank_cols = ["rank_ep1", "rank_ep2", "rank_ep3", "rank_ep4", "rank_ep5", "rank_ep6"]
for col in rank_cols:
sw_model[col] = pd.to_numeric(sw_model[col], errors="coerce")
sw_model = sw_model.drop(columns=["respondent_id"])
from pandas.api.types import is_object_dtype
cat_cols = [c for c in sw_model.columns if is_object_dtype(sw_model[c])]
sw_model_dum = pd.get_dummies(sw_model, columns=cat_cols, drop_first=True)
sw_model_dum = sw_model_dum.dropna()
sw_model_dum.shape(672, 95)
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
type your results and analysis here For these graphs I basically just coppied the ones that were in the article. I had to turn it sideways and was unable to get it to be in the same order as the article or put the percentage next to the bars, but I think this was pretty close.
# Visual 1: Which 'Star Wars' Movies Have You Seen?
# Columns that tell us whether each film was seen
movie_cols = ["seen_ep1", "seen_ep2", "seen_ep3", "seen_ep4", "seen_ep5", "seen_ep6"]
movie_names = [
"The Phantom Menace",
"Attack of the Clones",
"Revenge of the Sith",
"A New Hope",
"The Empire Strikes Back",
"Return of the Jedi",
]
# Only use people who have seen at least one Star Wars film
mask_seen_any = sw["seen_any"] == "Yes"
# Build a small table with the share who have seen each film
rows = []
for col, name in zip(movie_cols, movie_names):
prob = sw.loc[mask_seen_any, col].notna().mean()
rows.append({"film": name, "prob_seen": prob})
movie_probs = pd.DataFrame(rows)
# Put the films in the same order as the article
movie_order = movie_names
movie_probs["film"] = pd.Categorical(
movie_probs["film"],
categories=movie_order,
ordered=True
)
movie_probs = movie_probs.sort_values("film")
# Turn probabilities into percents for plotting
movie_probs["prob_seen_pct"] = (movie_probs["prob_seen"] * 100).round(0)
# Make the bar chart
ggplot(movie_probs, aes(x="prob_seen_pct", y="film")) + \
geom_bar(stat="identity") + \
ggsize(800, 400) + \
labs(
title="Which 'Star Wars' Movies Have You Seen?",
subtitle="Of respondents who have seen any film",
x="Percent of respondents",
y=""
)movie_cols = ["seen_ep1", "seen_ep2", "seen_ep3", "seen_ep4", "seen_ep5", "seen_ep6"]
rank_cols = ["rank_ep1", "rank_ep2", "rank_ep3", "rank_ep4", "rank_ep5", "rank_ep6"]
movie_names = movie_order
seen_all_mask = sw[movie_cols].notna().all(axis=1)
sw_all = sw.loc[seen_all_mask].copy()
for col in rank_cols:
sw_all[col] = pd.to_numeric(sw_all[col], errors="coerce")
fav_idx = sw_all[rank_cols].idxmin(axis=1)
name_map = dict(zip(rank_cols, movie_names))
fav_names = fav_idx.map(name_map)
fav_counts = fav_names.value_counts().reindex(movie_order).reset_index()
fav_counts.columns = ["film", "count"]
fav_counts["share_pct"] = (fav_counts["count"] / len(sw_all) * 100).round(0)
ggplot(fav_counts, aes(x="share_pct", y="film")) + \
geom_bar(stat="identity") + \
ggsize(800, 400) + \
labs(
title="What's the Best 'Star Wars' Movie?",
subtitle="Of respondents who have seen all six films",
x="Percent of respondents",
y=""
)Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.
type your results and analysis here For this chunk I really wanted to use the income to train the model. But that would be cheating. So I drop the income and train the model using all the other info given. it is pretty interesting.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X = sw_model_dum.drop(columns=["high_income", "income_numeric"])
y = sw_model_dum["high_income"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25,
random_state=42,
stratify=y
)
baseline_acc = y_test.value_counts(normalize=True).max()
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
baseline_acc, test_acc(np.float64(0.6369047619047619), 0.6190476190476191)
Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.
type your results and analysis here This was easy. I just included the income to train the model. It might not be what you were looking for though.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X = sw_model_dum.drop(columns=["high_income"])
y = sw_model_dum["high_income"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=0
)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)Test accuracy: 1.0
Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.
type your results and analysis here
# Include and execute your code hereCreate a new column that converts the location groupings to a single number. Drop the location categorical column.
type your results and analysis here
# Include and execute your code here